Library Imports

from pyspark.sql import SparkSession
from pyspark.sql import types as T

Template

spark = (
    SparkSession.builder
    .master("local")
    .appName("Exploring Joins")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()
)

sc = spark.sparkContext

Create a DataFrame

schema = T.StructType([
    T.StructField("pet_id", T.IntegerType(), False),
    T.StructField("name", T.StringType(), True),
    T.StructField("age", T.IntegerType(), True),
])

data = [
    (1, "Bear", 13), 
    (2, "Chewie", 12), 
    (2, "Roger", 1), 
]

pet_df = spark.createDataFrame(
    data=data,
    schema=schema
)

pet_df.toPandas()
pet_id name age
0 1 Bear 13
1 2 Chewie 12
2 2 Roger 1

Background

There are 3 datatypes in spark RDD, DataFrame and Dataset. As mentioned before, we will focus on the DataFrame datatype.

  • This is most performant and commonly used datatype.
  • RDDs are a thing of the past and you should refrain from using them unless you can't do the transformation in DataFrames.
  • Datasets are a thing in Spark scala.

If you have used a DataFrame in Pandas, this is the same thing. If you haven't, a dataframe is similar to a csv or excel file. There are columns and rows that you can perform transformations on. You can search online for better descriptions of what a DataFrame is.

What Happened?

For any DataFrame (df) that you work with in Spark you should provide it with 2 things:

  1. a schema for the data. Providing a schema explicitly makes it clearer to the reader and sometimes even more performant, if we can know that a column is nullable. This means providing 3 things:
    • the name of the column
    • the datatype of the column
    • the nullability of the column
  2. the data. Normally you would read data stored in gcs, aws etc and store it in a df, but there will be the off-times that you will need to create one.

results matching ""

    No results matching ""